Text analysis meets corpus linguistics

نویسندگان

  • Hannah Kermes
  • Stefan Evert
چکیده

In recent years, there has been rising interest to using evidence derived from automatic syntactic analysis in large-scale corpus studies. Ideally, of course, corpus linguists would prefer to have access to the wealth of structural and featural information provided by a full parser based on a complex grammar formalism. However, to date such parsers achieve neither the speed nor the robustness needed to process hundreds of millions of words. Beside this practical limitation, there are at least two more fundamental problems with this approach. Firstly, complex grammars tend to produce highly ambiguous output. Without extensive lexical and semantic knowledge, there will often be thousands of different analyses for any given sentence. Secondly, full parser usually embrace a particular theoretical perspective, embodied in the grammar formalism they use. If the researcher’s perspective on syntax is different from that of the parser, he or she will find it difficult, if not outright impossible, to apply the parser’s analyses to the research question.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computer Assisted Legal Linguistics (CAL2)

We introduce Computer Assisted Legal Linguistics (CAL2) as a semiautomated method to “make sense” of legal discourse by systematically analyzing large collections of legal texts. Such digital corpora have been increasingly used in computational linguistics in recent years, as part of a quantitative research strategy designed to complement (rather than supplant) the more qualitative methods used...

متن کامل

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Emdros - a text database engine for analyzed or annotated text

Emdros is a text database engine for linguistic analysis or annotation of text. It is appliccable especially in corpus linguistics for storing and retrieving linguistic analyses of text, at any linguistic level. Emdros implements the EMdF text database model and the MQL query language. In this paper, I present both, and give an example of how Emdros can be useful in computational linguistics.

متن کامل

Roland Bluhm

The aim of this paper is to discuss the potential benefit of corpus analysis, a (partly) empirical method from linguistics, for philosophy� ‘Corpus analysis’ is not only the name of the method, but also a rough description of it, because the method consists in analysing data taken from linguistic text corpora� In linguistics, using such text corpora is an established practice� A fair number of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003